feat: add complete automation infrastructure for plot generation by MarkusNeusinger · Pull Request #4 · MarkusNeusinger/anyplot

MarkusNeusinger · 2025-11-24T12:29:39Z

Summary

Adds complete automation infrastructure for AI-powered plot generation:

✅ Versioned spec template system (v1.0.0)
✅ Code generation with self-review loop
✅ GitHub Actions workflows (@claude triggers for visibility)
✅ Multi-version testing (Python 3.10-3.13)
✅ Preview generation and quality check workflows
✅ Spec upgrade tools (structural + AI-powered)

Workflows

spec-to-code.yml: Posts @claude comment when "approved" label added
claude.yml: Executes generation task with Claude Code
test-and-preview.yml: Runs tests and generates previews (GCS optional)
quality-check.yml: AI evaluation of generated plots (requires GCS)

Testing Plan

Phase 1 (No GCS required):

Create test issue with scatter-basic-001 spec
Add "approved" label
Watch code generation in Claude Code Web
Verify PR creation

Phase 2 (After GCS setup):

Preview generation and upload
Quality evaluation with vision

Related issue: Initial automation infrastructure setup

- Create detailed 13-phase plan for first POC version - Include manual workflow orchestration via Claude Code - Cover matplotlib and seaborn implementations - Define self-review loop, multi-version testing, and preview optimization - Specify GCS upload, PostgreSQL metadata, API, and minimal frontend - Provide code examples, scripts, and troubleshooting guide - Estimate 4-5 hours total implementation time This plan validates the end-to-end workflow before automating with GitHub Actions.

Core Components: - Spec template (.template.md) for consistent plot specifications - Example spec: scatter-basic-001.md - Plot generator with Claude + versioned rules + self-review loop - Three GitHub Actions workflows for full automation Workflows: 1. spec-to-code.yml: Auto-generates code when issue gets 'approved' label - Extracts spec from issue - Generates matplotlib + seaborn implementations - Self-review loop (max 3 attempts) - Creates PR automatically 2. test-and-preview.yml: Tests code and generates preview images - Multi-version testing (Python 3.10-3.13) - Generates preview PNGs - Uploads to GCS - Comments on PR with preview links 3. quality-check.yml: AI quality evaluation with Claude Vision - Downloads previews from GCS - Evaluates against spec quality criteria - Scores each implementation (0-100) - Comments with detailed feedback - Adds labels (quality-approved or quality-check-failed) This infrastructure enables: - Complete automation from GitHub Issue → Code → Test → Preview → Quality - Self-review and quality gates built-in - No manual steps required for plot generation - Production-ready code output Next: Test with scatter-basic-001 issue

Add version tracking to spec files for maintainability and automated upgrades. Changes: - Add version marker to spec template (v1.0.0) - Update scatter-basic-001.md with version info - Create upgrade_specs.py script for automated spec upgrades - Add specs/VERSIONING.md documentation - Update plot_generator.py to check and display spec version Benefits: - Easy template evolution without breaking existing specs - Automated migration when template improves - Clear tracking of which template version each spec uses - Version upgrade can be batch-processed Usage: # Check what would change python automation/scripts/upgrade_specs.py --dry-run # Upgrade all specs to latest version python automation/scripts/upgrade_specs.py # Upgrade specific spec python automation/scripts/upgrade_specs.py --spec scatter-basic-001 This enables continuous improvement of the spec template while maintaining backward compatibility.

Add Claude-powered upgrade system for intelligent spec improvements beyond structural changes. New Features: - upgrade_specs_ai.py: AI-powered spec upgrader using Claude - Semantic improvements (better wording, clearer criteria) - Preserves spec ID and core intent - Automatic backup creation (.backup-{version} files) - Dry-run mode for preview - Single spec or batch upgrade - specs/upgrades/: Directory for version-specific upgrade instructions - Custom instructions per version transition - Examples and before/after comparisons - Rationale for changes - Updated VERSIONING.md with AI upgrade documentation Why AI-Powered Upgrades? - Structural changes (new sections) are easy with regex - Semantic improvements (better wording, specificity) need understanding - AI can reformulate quality criteria to be measurable - AI can enhance parameter descriptions with types and ranges - AI preserves intent while improving clarity Usage: export ANTHROPIC_API_KEY=sk-ant-... # Preview changes python automation/scripts/upgrade_specs_ai.py --dry-run # Upgrade all specs (with backups) python automation/scripts/upgrade_specs_ai.py --version 1.0.0 # Upgrade single spec python automation/scripts/upgrade_specs_ai.py --spec scatter-basic-001 Combined Strategy: 1. upgrade_specs.py - For structural changes (fast, no API calls) 2. upgrade_specs_ai.py - For semantic improvements (AI-powered) This enables continuous spec quality improvement as we learn better documentation patterns.

Replace direct Anthropic API calls with Claude Code Action for AI tasks. This allows the workflows to use the existing CLAUDE_CODE_OAUTH_TOKEN instead of requiring a separate ANTHROPIC_API_KEY. Changes: spec-to-code.yml: - ✅ Trigger remains label-based ('approved' label on issue) - ✅ Spec extraction remains unchanged - ✅ Code generation now uses Claude Code Action - ✅ Provides detailed prompt with rules and requirements - ✅ Claude Code reads specs, rules, and generates implementations - ✅ Self-review loop integrated into Claude Code execution - ✅ Automatic commit and PR creation quality-check.yml: - ✅ Trigger remains workflow_run based - ✅ Preview image download from GCS unchanged - ✅ Quality evaluation now uses Claude Code Action - ✅ Claude Code views images (Vision) and evaluates against specs - ✅ Uses gh CLI to comment on PR and add labels - ✅ Scores each implementation (0-100, ≥85 to pass) Benefits: - No separate API key needed - Uses existing Claude Code Max subscription - Integrated with Claude Code ecosystem - Cleaner, more maintainable code - Claude Code handles commits automatically - Better error handling and retries Hybrid Approach: - Label triggers → Automation starts - Claude Code → AI-powered tasks (generation, evaluation) - Regular Actions → Infrastructure (tests, GCS upload, PR creation) This enables full automation using only CLAUDE_CODE_OAUTH_TOKEN and GCS_CREDENTIALS secrets.

@claude

Make automation workflows visible and trackable in Claude Code Web by using @claude triggers instead of direct API calls. This allows you to see the workflow progress live in your Claude Code Web session. Changes: 1. spec-to-code.yml (Code Generation): - Simplified to just trigger @claude with instructions - GitHub Action posts @claude comment when 'approved' label is added - Claude Code (via claude.yml) picks up the task - User sees generation progress in Claude Code Web - Claude Code creates implementations, commits, and makes PR 2. quality-check.yml (Quality Evaluation): - Posts @claude comment with evaluation instructions - Claude Code views preview images and evaluates - User sees evaluation progress in Claude Code Web - Claude Code posts results and adds labels 3. claude.yml (Core): - Enhanced permissions: contents:write, pull-requests:write, issues:write - Allows Claude Code to commit and create PRs - Fetch full history for better context Workflow Flow: 1. Issue gets 'approved' label 2. spec-to-code.yml posts @claude comment 3. claude.yml triggers (sees @claude) 4. Claude Code executes in Web (YOU SEE THIS LIVE!) 5. Claude Code commits implementations 6. Claude Code creates PR 7. test-and-preview.yml runs (regular CI) 8. quality-check.yml posts @claude comment 9. Claude Code evaluates (YOU SEE THIS LIVE!) 10. Claude Code posts quality report + labels Benefits: - 👁️ Full visibility in Claude Code Web - 🎮 Can intervene if needed - 📊 See progress in real-time - 🔍 Better debugging when things go wrong - 💬 Natural conversation-based workflow - 🤖 Same AI throughout (Claude Code Max)

claude · 2025-11-24T12:31:53Z

Pull Request Review: Complete Automation Infrastructure

🎯 Overview

This PR adds comprehensive automation infrastructure for AI-powered plot generation. The approach is well-architected and aligns with the project's specification-first philosophy. Here's my detailed review:

✅ Strengths

Architecture & Design

Excellent workflow orchestration: The 4-workflow chain (spec-to-code → claude → test-and-preview → quality-check) is well thought out
Clean separation of concerns: Each workflow has a clear, focused purpose
Multi-version testing: Python 3.10-3.13 matrix testing ensures broad compatibility
Self-review loop: The plot_generator.py implements iterative refinement (max 3 attempts)
Versioned rules system: Enables A/B testing and rollback capabilities

Code Quality

Type hints: Proper use of Literal types and return type annotations in plot_generator.py
Error handling: Good use of try-except blocks with clear error messages
Documentation: Comprehensive docstrings throughout
Modular design: Clear separation between generation, validation, and file operations

🔴 Critical Issues

1. Security: API Key Exposure Risk (plot_generator.py:169-173)

Location: automation/generators/plot_generator.py:89-93

api_key = os.getenv("ANTHROPIC_API_KEY")
if not api_key:
    raise ValueError("ANTHROPIC_API_KEY environment variable not set")

Issue: While the code correctly uses environment variables, the model ID is hardcoded:

model="claude-sonnet-4-20250514"

Concerns:

This model ID may not exist or may be deprecated
Hardcoding prevents easy model switching
No API rate limiting or error handling for API calls

Recommendation:

model = os.getenv("CLAUDE_MODEL", "claude-sonnet-4-20250514")
# Add retry logic with exponential backoff
from anthropic import APIError, RateLimitError

try:
    response = client.messages.create(...)
except RateLimitError as e:
    # Handle rate limiting
except APIError as e:
    # Handle API errors

2. Workflow Artifact Passing Issue (quality-check.yml:25-30)

Location: .github/workflows/quality-check.yml:25-30

- name: Download preview metadata
  uses: actions/download-artifact@v4
  with:
    name: preview-metadata
    run-id: ${{ github.event.workflow_run.id }}
    github-token: ${{ secrets.GITHUB_TOKEN }}

Issue: The workflow_run event has known limitations with artifact downloading across workflow boundaries. Artifacts from the triggering workflow may not be accessible.

Recommendation: Use a persistent storage mechanism (e.g., commit a file to the repo, or use GitHub cache) or restructure to use a single workflow with multiple jobs.

3. Code Injection Vulnerability (test-and-preview.yml:111)

Location: .github/workflows/test-and-preview.yml:111

MPLBACKEND=Agg uv run python "" 2>&1 | tee "preview_outputs/${SPEC_ID}_${LIBRARY}_${VARIANT}.log"

Issue: Directly executing Python files from user-submitted PRs without sandboxing or validation is a major security risk. Malicious code could:

Exfiltrate secrets
Modify repository contents
Access GitHub Actions environment

Critical Recommendation:

Run in a sandboxed environment (Docker container with no network access)
Use resource limits (CPU, memory, time)
Only execute from trusted branches after review
Consider using PyPy or RestrictedPython

- name: Generate preview images
  run: |
    docker run --rm --network none --cpus="1.0" --memory="512m" \
      -v $(pwd):/workspace -w /workspace \
      python:3.12-slim bash -c "MPLBACKEND=Agg python "

⚠️ High Priority Issues

4. Fragile File Path Parsing (test-and-preview.yml:103-105, quality-check.yml:86)

LIBRARY=$(echo "" | cut -d'/' -f2)
SPEC_ID=$(echo "" | cut -d'/' -f4)
VARIANT=$(basename "" .py)

Issue: Hardcoded field positions will break if directory structure changes. No validation that the path matches the expected format.

Recommendation:

# Use regex with validation
if [[ $file =~ ^plots/([^/]+)/([^/]+)/([^/]+)/([^/]+)\.py$ ]]; then
  LIBRARY="${BASH_REMATCH[1]}"
  PLOT_TYPE="${BASH_REMATCH[2]}"
  SPEC_ID="${BASH_REMATCH[3]}"
  VARIANT="${BASH_REMATCH[4]}"
else
  echo "Invalid file path format: $file"
  exit 1
fi

5. Missing Error Handling (plot_generator.py:177-181)

if "python")[1].split("" in code:
    code = code.split("")[0].strip()

Issues:

No validation that code was successfully extracted
Will fail silently if format is unexpected
No syntax validation before saving

Recommendation:

import ast

def extract_code_from_response(text: str) -> str:
    if "python")[1].split("" in text:
        code = text.split("")[0].strip()
    else:
        code = text.strip()
    
    # Validate syntax
    try:
        ast.parse(code)
    except SyntaxError as e:
        raise ValueError(f"Generated code has syntax errors: {e}")
    
    return code

6. Hardcoded GCS Assumption (test-and-preview.yml:128-140)

The workflow assumes GCS is always available. The if: steps.changed_plots.outputs.changed_files != '' condition doesn't account for missing GCS credentials.

Recommendation:

- name: Setup Google Cloud authentication
  if: steps.changed_plots.outputs.changed_files != '' && secrets.GCS_CREDENTIALS != ''
  continue-on-error: true  # Don't fail if GCS unavailable
  id: gcs_auth
  uses: google-github-actions/auth@v2
  with:
    credentials_json: ${{ secrets.GCS_CREDENTIALS }}

- name: Upload previews to GCS
  if: steps.gcs_auth.outcome == 'success'
  # ...

7. Spec ID Extraction Fragility (spec-to-code.yml:22-27)

SPEC_ID=$(echo "$ISSUE_TITLE" | grep -oP '^[a-z]+-[a-z]+-\d{3}' || echo "")

Issues:

Only allows lowercase letters (fails for Scatter-basic-001)
Requires exactly 3 digits (fails for scatter-basic-1 or scatter-basic-1234)
No validation of spec file existence

Recommendation:

SPEC_ID=$(echo "$ISSUE_TITLE" | grep -oiP '^[a-z]+-[a-z]+-\d{3,4}' | tr '[:upper:]' '[:lower:]' || echo "")

if [ -n "$SPEC_ID" ] && [ ! -f "specs/${SPEC_ID}.md" ]; then
  echo "⚠️  Warning: Spec file specs/${SPEC_ID}.md does not exist"
fi

🟡 Medium Priority Issues

8. Incomplete Test Coverage Path

- name: Run tests
  run: |
    uv run pytest tests/ -v --cov=plots --cov-report=xml --cov-report=term

Issue: Coverage is only measured for plots/ directory, but the PR adds code in automation/ that should also be tested.

Recommendation:

uv run pytest tests/ -v --cov=plots --cov=automation --cov-report=xml --cov-report=term

9. Magic Numbers & Hardcoded Values

max_tokens=4000 (line 171) - Why 4000? Should be configurable
max_tokens=2000 (line 219) - Different for review?
max_attempts: int = 3 - Should be configurable via environment variable
Plot type mapping (lines 100-104) - Incomplete and will fail for new libraries

10. Missing Logging

The plot_generator.py uses print statements instead of proper logging:

print(f"📋 Spec version: {spec_version}")
print(f"🔄 Attempt {attempt}/{max_attempts}")

Recommendation: Use Python's logging module for structured logging with levels (DEBUG, INFO, WARNING, ERROR).

11. Preview File Name Assumptions (test-and-preview.yml:114-125)

if [ -f "test_output.png" ]; then
  mv test_output.png "$OUTPUT_FILE"
elif [ -f "output.png" ]; then
  mv output.png "$OUTPUT_FILE"
elif [ -f "${SPEC_ID}.png" ]; then
  mv "${SPEC_ID}.png" "$OUTPUT_FILE"

Issue: This assumes plot scripts save to these specific filenames. Better to enforce a convention via generation rules.

Recommendation: Require all plot scripts to save to a predictable location specified in an environment variable.

💡 Suggestions & Best Practices

12. Add Integration Tests

The PR adds workflows but no tests for them. Consider:

Testing plot_generator.py with mock API responses
Testing spec upgrade scripts
Testing path parsing logic in a unit test

13. Add Rate Limiting Protection

import time
from functools import wraps

def rate_limit(calls_per_minute: int):
    min_interval = 60.0 / calls_per_minute
    last_called = [0.0]
    
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            elapsed = time.time() - last_called[0]
            if elapsed < min_interval:
                time.sleep(min_interval - elapsed)
            result = func(*args, **kwargs)
            last_called[0] = time.time()
            return result
        return wrapper
    return decorator

@rate_limit(calls_per_minute=5)
def call_claude_api(...):
    ...

14. Add Telemetry

Track success/failure rates, generation times, review attempts, etc. to monitor system health.

15. Version String Comparison Issue (upgrade_specs.py:76)

if current_version < "1.0.0" and target_version >= "1.0.0":

Issue: String comparison doesn't work correctly for semantic versions ("1.10.0" < "1.2.0" as strings).

Recommendation: Use packaging.version.Version for proper semantic version comparison.

16. Add Dry-Run Mode to Workflows

Allow testing workflow logic without actual execution (e.g., workflow_dispatch with input parameter).

📋 Checklist for Merge

Must Fix Before Merge:

Issue Add Claude Code GitHub Workflow #3: Add sandboxing/containerization for executing untrusted code
Issue Add workflow diagram for new prototype discovery #1: Add retry logic and error handling for Claude API calls
Issue Add Claude Code GitHub Workflow #2: Fix artifact passing between workflows or restructure
Issue feat: add complete automation infrastructure for plot generation #4: Use robust path parsing with validation
Issue #5: Add code syntax validation before saving

Should Fix Before Merge:

Issue Create test issue and merge via GitHub #6: Make GCS optional with proper error handling
Issue #7: Improve spec ID extraction robustness
Issue #8: Extend test coverage to automation/ directory
Issue feat: implement scatter-basic-006 #15: Use proper semantic version comparison

Nice to Have:

Add integration tests for plot_generator.py
Replace print statements with logging
Add rate limiting for API calls
Make configuration values (max_tokens, max_attempts) configurable

🎯 Overall Assessment

Quality Score: 75/100

Verdict: ⚠️ NEEDS IMPROVEMENTS before merge

Breakdown:

Architecture: ⭐⭐⭐⭐⭐ (5/5) - Excellent design
Security: ⭐⭐ (2/5) - Critical code execution vulnerability
Code Quality: ⭐⭐⭐⭐ (4/5) - Good but needs error handling
Testing: ⭐⭐⭐ (3/5) - Missing tests for new code
Documentation: ⭐⭐⭐⭐ (4/5) - Well documented

Summary

This is a solid foundation with excellent architectural design. The workflow orchestration and self-review loop are particularly well thought out. However, the critical security issue with executing untrusted code must be addressed before merging. The other issues around error handling, path parsing, and artifact passing should also be fixed for production readiness.

The code demonstrates strong understanding of the problem domain and follows project conventions well. With the security fixes and improved error handling, this will be a valuable addition to the project.

🚀 Recommendations

Immediate: Address security issue Add Claude Code GitHub Workflow #3 (sandboxing)
Short term: Fix critical path parsing and error handling issues
Before production: Add comprehensive integration tests and monitoring
Future: Consider adding metrics/observability for generation quality tracking

Great work on the automation infrastructure! Let me know if you'd like me to help with any of the fixes.

claude · 2025-11-24T12:32:32Z

Pull Request Review: Complete Automation Infrastructure

🎯 Overview

This PR adds comprehensive automation infrastructure for AI-powered plot generation. The approach is well-architected and aligns with the project's specification-first philosophy. Here's my detailed review:

✅ Strengths

Architecture & Design

Excellent workflow orchestration: The 4-workflow chain (spec-to-code → claude → test-and-preview → quality-check) is well thought out
Clean separation of concerns: Each workflow has a clear, focused purpose
Multi-version testing: Python 3.10-3.13 matrix testing ensures broad compatibility
Self-review loop: The plot_generator.py implements iterative refinement (max 3 attempts)
Versioned rules system: Enables A/B testing and rollback capabilities

Code Quality

Type hints: Proper use of Literal types and return type annotations in plot_generator.py
Error handling: Good use of try-except blocks with clear error messages
Documentation: Comprehensive docstrings throughout
Modular design: Clear separation between generation, validation, and file operations

🔴 Critical Issues

1. Security: API Key Exposure Risk (plot_generator.py:169-173)

Location: automation/generators/plot_generator.py:89-93

While the code correctly uses environment variables, the model ID is hardcoded. This prevents easy model switching and has no API rate limiting or error handling for API calls.

Recommendation: Make model configurable and add retry logic with exponential backoff for rate limiting and API errors.

2. Workflow Artifact Passing Issue (quality-check.yml:25-30)

Location: .github/workflows/quality-check.yml:25-30

The workflow_run event has known limitations with artifact downloading across workflow boundaries. Artifacts from the triggering workflow may not be accessible.

Recommendation: Use a persistent storage mechanism (e.g., commit a file to the repo, or use GitHub cache) or restructure to use a single workflow with multiple jobs.

3. Code Injection Vulnerability (test-and-preview.yml:111)

Location: .github/workflows/test-and-preview.yml:111

MPLBACKEND=Agg uv run python "$file" 2>&1 | tee "preview_outputs/${SPEC_ID}_${LIBRARY}_${VARIANT}.log"

Issue: Directly executing Python files from user-submitted PRs without sandboxing or validation is a major security risk. Malicious code could:

Exfiltrate secrets
Modify repository contents
Access GitHub Actions environment

Critical Recommendation: Run in a sandboxed environment (Docker container with no network access) with resource limits (CPU, memory, time), and only execute from trusted branches after review.

⚠️ High Priority Issues

4. Fragile File Path Parsing (test-and-preview.yml:103-105, quality-check.yml:86)

LIBRARY=$(echo "$file" | cut -d'/' -f2)
SPEC_ID=$(echo "$file" | cut -d'/' -f4)
VARIANT=$(basename "$file" .py)

Issue: Hardcoded field positions will break if directory structure changes. No validation that the path matches the expected format.

Recommendation: Use regex with validation to ensure path matches expected format before parsing.

5. Missing Error Handling (plot_generator.py:177-181)

Code extraction from Claude responses has no validation that code was successfully extracted, will fail silently if format is unexpected, and has no syntax validation before saving.

Recommendation: Add Python AST parsing to validate syntax before saving generated code.

6. Hardcoded GCS Assumption (test-and-preview.yml:128-140)

The workflow assumes GCS is always available. The condition doesn't account for missing GCS credentials.

Recommendation: Add proper conditional logic to gracefully handle missing GCS credentials with continue-on-error: true.

7. Spec ID Extraction Fragility (spec-to-code.yml:22-27)

Issues:

Only allows lowercase letters (fails for Scatter-basic-001)
Requires exactly 3 digits (fails for scatter-basic-1 or scatter-basic-1234)
No validation of spec file existence

Recommendation: Use case-insensitive regex and validate spec file exists before proceeding.

🟡 Medium Priority Issues

8. Incomplete Test Coverage Path

Coverage is only measured for plots/ directory, but the PR adds code in automation/ that should also be tested.

Recommendation: Add --cov=automation to pytest command.

9. Magic Numbers & Hardcoded Values

max_tokens=4000 (line 171) - Why 4000? Should be configurable
max_tokens=2000 (line 219) - Different for review?
max_attempts: int = 3 - Should be configurable via environment variable
Plot type mapping (lines 100-104) - Incomplete and will fail for new libraries

10. Missing Logging

The plot_generator.py uses print statements instead of proper logging. Use Python's logging module for structured logging with levels (DEBUG, INFO, WARNING, ERROR).

11. Preview File Name Assumptions (test-and-preview.yml:114-125)

Assumes plot scripts save to specific filenames. Better to enforce a convention via generation rules and use environment variables.

💡 Suggestions & Best Practices

12. Add Integration Tests

The PR adds workflows but no tests for them. Consider:

Testing plot_generator.py with mock API responses
Testing spec upgrade scripts
Testing path parsing logic in a unit test

13. Add Rate Limiting Protection

Implement rate limiting decorator for Claude API calls to prevent hitting rate limits.

14. Add Telemetry

Track success/failure rates, generation times, review attempts, etc. to monitor system health.

15. Version String Comparison Issue (upgrade_specs.py:76)

if current_version < "1.0.0" and target_version >= "1.0.0":

Issue: String comparison doesn't work correctly for semantic versions ("1.10.0" < "1.2.0" as strings).

Recommendation: Use packaging.version.Version for proper semantic version comparison.

16. Add Dry-Run Mode to Workflows

Allow testing workflow logic without actual execution (e.g., workflow_dispatch with input parameter).

📋 Checklist for Merge

Must Fix Before Merge:

Issue Add Claude Code GitHub Workflow #3: Add sandboxing/containerization for executing untrusted code
Issue Add workflow diagram for new prototype discovery #1: Add retry logic and error handling for Claude API calls
Issue Add Claude Code GitHub Workflow #2: Fix artifact passing between workflows or restructure
Issue feat: add complete automation infrastructure for plot generation #4: Use robust path parsing with validation
Issue #5: Add code syntax validation before saving

Should Fix Before Merge:

Issue Create test issue and merge via GitHub #6: Make GCS optional with proper error handling
Issue #7: Improve spec ID extraction robustness
Issue #8: Extend test coverage to automation/ directory
Issue feat: implement scatter-basic-006 #15: Use proper semantic version comparison

Nice to Have:

Add integration tests for plot_generator.py
Replace print statements with logging
Add rate limiting for API calls
Make configuration values (max_tokens, max_attempts) configurable

🎯 Overall Assessment

Quality Score: 75/100

Verdict: ⚠️ NEEDS IMPROVEMENTS before merge

Breakdown:

Architecture: ⭐⭐⭐⭐⭐ (5/5) - Excellent design
Security: ⭐⭐ (2/5) - Critical code execution vulnerability
Code Quality: ⭐⭐⭐⭐ (4/5) - Good but needs error handling
Testing: ⭐⭐⭐ (3/5) - Missing tests for new code
Documentation: ⭐⭐⭐⭐ (4/5) - Well documented

Summary

This is a solid foundation with excellent architectural design. The workflow orchestration and self-review loop are particularly well thought out. However, the critical security issue with executing untrusted code must be addressed before merging. The other issues around error handling, path parsing, and artifact passing should also be fixed for production readiness.

The code demonstrates strong understanding of the problem domain and follows project conventions well. With the security fixes and improved error handling, this will be a valuable addition to the project.

🚀 Recommendations

Immediate: Address security issue Add Claude Code GitHub Workflow #3 (sandboxing)
Short term: Fix critical path parsing and error handling issues
Before production: Add comprehensive integration tests and monitoring
Future: Consider adding metrics/observability for generation quality tracking

Great work on the automation infrastructure! The architecture is sound and aligns well with the project's goals. Focus on addressing the security concerns and error handling, and this will be production-ready.

Addresses critical review feedback with 5 key improvements: 1. GCS Optional: Make GCS upload optional with proper error handling - Add continue-on-error and conditional checks - Workflows work without GCS credentials 2. Spec ID Extraction: Case-insensitive, flexible digit count - Support 3-4 digits (001-9999) - Convert to lowercase automatically - Validate spec file existence 3. Code Syntax Validation: AST-based validation before saving - New extract_and_validate_code() function - Catches syntax errors early with retry - Clear error messages 4. Path Parsing: Regex-based validation with error handling - Replace fragile cut-based parsing - Validate format: plots/{lib}/{type}/{spec}/{variant}.py - Skip invalid files with warning 5. API Error Handling: Retry with exponential backoff - Retry on RateLimitError and APIConnectionError - 3 retries with 2s, 4s, 8s delays - Don't retry on other API errors Ready for initial testing without GCS setup.

claude · 2025-11-24T12:41:43Z

🔍 Code Review

Thank you for this comprehensive PR adding automation infrastructure for plot generation! This is a well-structured implementation with good separation of concerns. Below is my detailed review:

✅ Strengths

Excellent Architecture
- Clean separation between spec generation, code generation, testing, and quality checks
- Well-designed workflow chaining (spec-to-code → test-and-preview → quality-check)
- Proper use of versioned rules system for maintainability
Good Error Handling
- Retry logic with exponential backoff in plot_generator.py:57-101
- Proper error messages and exit codes
- Validation of API keys and file paths
Security Best Practices
- Secrets properly managed via GitHub Secrets (not hardcoded)
- GCS credentials handled via OIDC with id-token: write
- Appropriate use of continue-on-error for optional GCS upload

🐛 Potential Bugs & Issues

High Priority

Spec ID Extraction May Miss Valid IDs (spec-to-code.yml:16-44)
- Current regex: [a-z]+-[a-z]+-\d{3,4} only allows lowercase
- But the pattern allows "3-4 digits" which could match IDs like scatter-basic-0001
- Issue: Inconsistent with docs stating {001-9999} format
- Recommendation: Standardize to exactly 3 digits: [a-z]+-[a-z]+-\d{3}
Checkout Missing in spec-to-code.yml
- The workflow extracts spec ID and validates file existence (line 38) but never checks out the repository
- Fix: Add checkout step before the extract step:
```
- name: Checkout repository
  uses: actions/checkout@v4
```
Unsafe Variable Reference in Quality Check (quality-check.yml:64-66)
- JavaScript template literal uses unescaped variable in string interpolation
- While unlikely to be exploited here, it's better to use proper escaping
- Recommendation: Pass as environment variables instead
Code Extraction Logic May Fail (plot_generator.py:37-40)
- Splits on triple backticks which could fail if there are multiple code blocks
- Better approach: Use more robust parsing with proper indexing

Medium Priority

Preview Image Name Parsing Bug (test-and-preview.yml:164-167)
- cut -d'_' -f1-3 assumes spec_id has exactly 3 parts separated by dashes
- For scatter-basic-001_matplotlib_default.png, this would produce scatter-basic-001
- But if spec IDs ever have more dashes, this breaks
- Recommendation: Parse filename using bash regex matching instead
Missing Error Handling for jq Failures (test-and-preview.yml:88)
- If jq fails, the while loop silently continues
- Recommendation: Add set -e at start of bash scripts or check jq exit code
Hardcoded Model Name (plot_generator.py:256, 309)
- Model claude-sonnet-4-20250514 is hardcoded
- Recommendation: Make configurable via environment variable with this as default

Low Priority

Version String Comparison May Fail (upgrade_specs.py:76)
- Uses string comparison current_version < "1.0.0"
- Works for simple versions but breaks for "1.10.0" < "1.9.0" (string comparison)
- Recommendation: Use proper semantic versioning library like packaging.version.parse()
Potential Race Condition in Preview Generation (test-and-preview.yml:121-132)
- Checks for hardcoded filenames (test_output.png, output.png, etc.)
- If multiple previews run in parallel (they don't currently), they could collide
- Not a current issue since it's sequential, but worth noting

🚀 Performance Considerations

Workflow Chaining Overhead
- quality-check.yml waits for test-and-preview.yml to fully complete
- Consider using workflow artifacts more efficiently or running quality checks in parallel with tests
Multi-Python Testing
- Testing on 4 Python versions (3.10-3.13) is thorough but adds ~4x time
- Optimization: Consider testing 3.10 and 3.13 only on PRs, full matrix on main branch
API Token Usage
- Each generation attempt uses 2 API calls (generation + self-review)
- With max 3 attempts = 6 API calls per implementation
- Cost consideration: Monitor token usage, may want to reduce max_attempts to 2
GCS Download/Upload
- Downloads entire preview directory with gsutil -m cp -r
- For many files, this could be slow
- Current scale: Not an issue yet, but consider pagination if >100 files

🔒 Security Concerns

Command Injection Risk (Low) (test-and-preview.yml:98-132)
- Uses $file variable from git output in bash commands
- While git output is trusted, it's good practice to quote variables
- Fix: Change to uv run python "$file" (already quoted ✅)
Secrets Exposure in Logs
- GCS_BUCKET is used in URLs (quality-check.yml:65)
- If bucket name is sensitive, this exposes it in comments
- Recommendation: Document that bucket name is not considered secret
Untrusted Code Execution (test-and-preview.yml:118)
- Executes generated Python code with uv run python "$file"
- This is by design but worth noting: malicious specs could generate harmful code
- Mitigation: Run in isolated GH Actions environment (✅ already done)

🧪 Test Coverage

Missing Tests

No Unit Tests for plot_generator.py
- Critical component with no test coverage
- Recommendation: Add tests in tests/unit/automation/test_plot_generator.py:
  - Test extract_and_validate_code() with various inputs
  - Mock API calls to test retry logic
  - Test error handling for missing files
No Tests for Upgrade Scripts
- upgrade_specs.py and upgrade_specs_ai.py lack tests
- Recommendation: Test version detection, upgrade logic, backup creation
No Workflow Tests
- Consider adding workflow validation with actionlint
- Check for common YAML/workflow issues

Test Structure Recommendations

# tests/unit/automation/generators/test_plot_generator.py
def test_extract_and_validate_code_success():
    response = '''Here's the code:
    ```python
    def create_plot():
        pass
    ```'''
    code = extract_and_validate_code(response)
    assert "def create_plot" in code

def test_extract_and_validate_code_syntax_error():
    with pytest.raises(ValueError, match="syntax errors"):
        extract_and_validate_code("def invalid syntax")

📝 Code Quality Suggestions

Add Type Hints to upgrade_specs.py
- Line 76: current_version < "1.0.0" comparison lacks type safety
- Import from typing import Optional
Improve Docstrings
- plot_generator.py has good docstrings ✅
- upgrade_specs.py could use more detailed examples in docstrings
Extract Magic Strings
- Multiple workflows reference "claude-sonnet-4-20250514"
- Consider centralizing in a config file or environment variable
Consider Adding Logging
- plot_generator.py uses print() statements
- For production, consider using Python logging module for better control
Shell Script Best Practices
- Add set -euo pipefail at the start of bash blocks for better error handling
- Example:
```
run: |
  set -euo pipefail
  # rest of script
```

🎯 Recommendations Summary

Must Fix Before Merge

✅ Add checkout step to spec-to-code.yml
✅ Fix preview image name parsing in test-and-preview.yml:164-167
✅ Add basic unit tests for plot_generator.py

Should Fix Soon

Standardize spec ID format (3 vs 4 digits)
Improve version string comparison in upgrade scripts
Add proper error handling for jq failures

Nice to Have

Add workflow validation with actionlint
Consider making model name configurable
Add logging instead of print statements
Extract magic strings to config

📊 Overall Assessment

Code Quality: ⭐⭐⭐⭐☆ (4/5) - Well structured with good patterns, minor issues

Security: ⭐⭐⭐⭐⭐ (5/5) - Proper secret management, appropriate isolation

Test Coverage: ⭐⭐☆☆☆ (2/5) - Missing unit tests for critical components

Documentation: ⭐⭐⭐⭐☆ (4/5) - Good inline docs, comprehensive CLAUDE.md

Performance: ⭐⭐⭐⭐☆ (4/5) - Reasonable for current scale, some optimization opportunities

✅ Conclusion

This is a solid foundation for the automation infrastructure! The architecture is well-designed and follows GitHub Actions best practices. The main concern is the lack of unit tests for critical components like plot_generator.py.

Recommendation: After addressing the "Must Fix" items (especially the missing checkout step and adding basic tests), this PR will be ready to merge.

Great work on the comprehensive automation setup! 🎉

…an-01QbVSDkhbSCB2w6qKJe59m5 feat: add complete automation infrastructure for plot generation

#1089) Fixes #967 ## Summary - Update quality threshold documentation from 85 to 90 - Add CLI retry logic for intermittent failures - Change issue lifecycle: spec-ready issues stay open until all implementations complete ## Changes ### Documentation (Issue Finding #4) Updates quality threshold from 85 to 90 in documentation to match actual workflow configuration in `impl-review.yml`. **Files:** - CLAUDE.md - README.md - docs/workflow.md - docs/concepts/claude-skill-plot-generation.md - prompts/quality-evaluator.md ### CLI Retry Logic (Issue Finding #3) Adds retry mechanism for Claude CLI steps to handle intermittent "Executable not found in $PATH" errors. Each Claude step now has `continue-on-error: true` and a retry step that runs if the first attempt fails. **Files:** - .github/workflows/spec-create.yml - .github/workflows/spec-update.yml - .github/workflows/impl-generate.yml - .github/workflows/impl-repair.yml - .github/workflows/util-claude.yml ### Issue Lifecycle (Bonus) - `spec-ready` issues now stay open until all 9 library implementations are merged - Changed `Closes #...` to `Related to #...` in spec PR body - Added auto-close logic in `impl-merge.yml` when all `impl:{library}:done` labels are present **Files:** - .github/workflows/spec-create.yml - .github/workflows/impl-merge.yml ## Test Plan - [ ] Verify documentation shows correct 90 threshold - [ ] Trigger a workflow and verify retry works on CLI failure - [ ] Create a test spec and verify issue stays open after merge - [ ] Verify issue closes when all 9 libraries have `impl:{library}:done`

- Fix #1: Handle missing remote branches in impl-generate.yml - Check if branch exists before checkout to avoid 'not a commit' errors - Fall back to creating fresh branch from main if remote doesn't exist - Fix #2: Clean up duplicate labels on failure - Remove both 'generate:X' and 'impl:X:pending' when marking as failed - Prevents label accumulation (e.g., both pending and failed) - Fix #3: Auto-close issues when done + failed = 9 - Previously only closed when all 9 were 'done' - Now closes when total (done + failed) reaches 9 - Shows which libraries failed in closing comment - Fix #4: Track generation failures and auto-mark as failed - Count previous failed runs for same spec/library - After 3 failures, mark as 'impl:X:failed' automatically - Posts failure comment explaining the library may not support this plot type

…1308) ## Summary Fixes several workflow issues discovered during batch processing of spec-ready issues. ### Fix #1: Branch-Not-Found Errors **Problem:** `fatal: 'origin/implementation/{spec}/{library}' is not a commit` errors when workflow tries to checkout a non-existent remote branch. **Solution:** Check if remote branch exists before checkout, fall back to creating fresh branch from main. ### Fix #2: Duplicate Labels **Problem:** Issues accumulate both `impl:X:pending` and `impl:X:failed` labels when generation fails. **Solution:** Failure handler removes both `generate:X` and `impl:X:pending` when marking as failed. ### Fix #3: Auto-Close with Failures **Problem:** Issues with 8 done + 1 failed stay OPEN because auto-close only triggers on 9 done. **Solution:** Close when `done + failed = 9`, with appropriate summary (shows which libraries failed). ### Fix #4: Generation Failure Tracking **Problem:** When `impl-generate` fails (no plot.png), no PR is created → no review → no repair → library stays `pending` forever. **Solution:** Track generation failures and mark as `impl:X:failed` after 3 consecutive failures. Posts comment explaining the library may not support this plot type. ## Files Changed - `.github/workflows/impl-generate.yml` - Fixes #1, #2, #4 - `.github/workflows/impl-merge.yml` - Fix #3 ## Testing - YAML syntax validated - Logic reviewed against observed failure patterns

Critical fixes: - Fix TypeError: flatten tag dict to list for repository call (#6) Changed search_by_tags to receive list[str] instead of dict[str, list[str]] Repository expects flat list of tag values, not nested dict - Add missing dataprep and styling parameters (#1, #7) Added to search_specs_by_tags function signature and docstring These categories were documented but not implemented - Add filter logic for dataprep and styling (#2) Implemented filtering checks similar to other impl-level tags Ensures new parameters actually filter results - Update condition to include dataprep and styling (#3) Modified impl-level filtering condition on line 117 Now checks all 6 impl-level tag categories Improvements: - Add database error handling with helpful messages (#8) Check is_db_configured() before all database operations Provides clear error message if DATABASE_URL not set - Update test mocks to match fixed interface (#5) Tests now verify flattened tag list instead of dict Added new test for dataprep/styling filter parameters Mock is_db_configured to return True in test fixture Verification: - All 16 unit tests passing - Ruff linting and formatting applied - No routing conflicts (#4 verified - no /mcp routes in routers) Related: PR #4132, Issue #4129

claude added 7 commits November 24, 2025 11:36

docs: remove POC plan (replaced by actual automation)

e3d2fba

MarkusNeusinger self-assigned this Nov 24, 2025

MarkusNeusinger merged commit 7ca4f26 into main Nov 24, 2025
2 of 6 checks passed

MarkusNeusinger deleted the claude/github-issue-poc-plan-01QbVSDkhbSCB2w6qKJe59m5 branch November 24, 2025 12:59

MarkusNeusinger added a commit that referenced this pull request Dec 1, 2025

Merge pull request #4 from MarkusNeusinger/claude/github-issue-poc-pl…

e2ffcfc

…an-01QbVSDkhbSCB2w6qKJe59m5 feat: add complete automation infrastructure for plot generation

MarkusNeusinger mentioned this pull request Dec 16, 2025

fix(workflows): update quality threshold to 90 and add CLI retry logic #1089

Merged

4 tasks

MarkusNeusinger mentioned this pull request Dec 17, 2025

fix(workflows): improve robustness of impl-generate and impl-merge #1308

Merged

claude bot mentioned this pull request Feb 22, 2026

update(bump-basic): highcharts — comprehensive quality review #4338

Merged

4 tasks

This was referenced Mar 11, 2026

feat(plotnine): implement spectrogram-mel #4762

Merged

feat(bokeh): implement spectrogram-mel #4767

Merged

feat(letsplot): implement eye-diagram-basic #4974

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add complete automation infrastructure for plot generation#4

feat: add complete automation infrastructure for plot generation#4
MarkusNeusinger merged 8 commits intomainfrom
claude/github-issue-poc-plan-01QbVSDkhbSCB2w6qKJe59m5

MarkusNeusinger commented Nov 24, 2025

Uh oh!

claude bot commented Nov 24, 2025

Uh oh!

claude bot commented Nov 24, 2025

Uh oh!

claude bot commented Nov 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MarkusNeusinger commented Nov 24, 2025

Summary

Workflows

Testing Plan

Uh oh!

claude bot commented Nov 24, 2025

Pull Request Review: Complete Automation Infrastructure

🎯 Overview

✅ Strengths

Architecture & Design

Code Quality

🔴 Critical Issues

1. Security: API Key Exposure Risk (plot_generator.py:169-173)

2. Workflow Artifact Passing Issue (quality-check.yml:25-30)

3. Code Injection Vulnerability (test-and-preview.yml:111)

⚠️ High Priority Issues

4. Fragile File Path Parsing (test-and-preview.yml:103-105, quality-check.yml:86)

5. Missing Error Handling (plot_generator.py:177-181)

6. Hardcoded GCS Assumption (test-and-preview.yml:128-140)

7. Spec ID Extraction Fragility (spec-to-code.yml:22-27)

🟡 Medium Priority Issues

8. Incomplete Test Coverage Path

9. Magic Numbers & Hardcoded Values

10. Missing Logging

11. Preview File Name Assumptions (test-and-preview.yml:114-125)

💡 Suggestions & Best Practices

12. Add Integration Tests

13. Add Rate Limiting Protection

14. Add Telemetry

15. Version String Comparison Issue (upgrade_specs.py:76)

16. Add Dry-Run Mode to Workflows

📋 Checklist for Merge

Must Fix Before Merge:

Should Fix Before Merge:

Nice to Have:

🎯 Overall Assessment

Breakdown:

Summary

🚀 Recommendations

Uh oh!

claude bot commented Nov 24, 2025

Pull Request Review: Complete Automation Infrastructure

🎯 Overview

✅ Strengths

Architecture & Design

Code Quality

🔴 Critical Issues

1. Security: API Key Exposure Risk (plot_generator.py:169-173)

2. Workflow Artifact Passing Issue (quality-check.yml:25-30)

3. Code Injection Vulnerability (test-and-preview.yml:111)

⚠️ High Priority Issues

4. Fragile File Path Parsing (test-and-preview.yml:103-105, quality-check.yml:86)

5. Missing Error Handling (plot_generator.py:177-181)

6. Hardcoded GCS Assumption (test-and-preview.yml:128-140)

7. Spec ID Extraction Fragility (spec-to-code.yml:22-27)

🟡 Medium Priority Issues

8. Incomplete Test Coverage Path

9. Magic Numbers & Hardcoded Values

10. Missing Logging

11. Preview File Name Assumptions (test-and-preview.yml:114-125)

💡 Suggestions & Best Practices

12. Add Integration Tests

13. Add Rate Limiting Protection

14. Add Telemetry

15. Version String Comparison Issue (upgrade_specs.py:76)

16. Add Dry-Run Mode to Workflows

📋 Checklist for Merge

Must Fix Before Merge:

Should Fix Before Merge:

Nice to Have:

🎯 Overall Assessment

Breakdown:

Summary

🚀 Recommendations

Uh oh!

claude bot commented Nov 24, 2025

🔍 Code Review

✅ Strengths

🐛 Potential Bugs & Issues